Derek Ruthardt
ReneWind Project
Problem Statement¶
Business Context¶
Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
Objective¶
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
- False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
- False positives (FP) are detections where there is no failure. These will result in inspection costs.
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
Data Description¶
- The data provided is a transformed version of original data which was collected using sensors.
- Train.csv - To be used for training and tuning of models.
- Test.csv - To be used only for testing the performance of the final best model.
- Both the datasets consist of 40 predictor variables and 1 target variable
Importing necessary libraries¶
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Mounting my Google Drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
Loading the dataset¶
#Loading the training data
data = pd.read_csv("/content/drive/MyDrive/Train.csv.csv")
#Loading the test data
test = pd.read_csv("/content/drive/MyDrive/Test.csv.csv")
#Make a copy of training set
df = data.copy()
#Make a copy of the test set
df_test = test.copy()
Data Overview¶
- Observations
- Sanity checks
#Shape of the df (training set copy)
df.shape
(20000, 41)
The training set has 20000 rows and 41 columns
#Shape of the test set copy
df_test.shape
(5000, 41)
The test set has 5000 rows and 41 columns
df.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.465 | -4.679 | 3.102 | 0.506 | -0.221 | -2.033 | -2.911 | 0.051 | -1.522 | 3.762 | -5.715 | 0.736 | 0.981 | 1.418 | -3.376 | -3.047 | 0.306 | 2.914 | 2.270 | 4.395 | -2.388 | 0.646 | -1.191 | 3.133 | 0.665 | -2.511 | -0.037 | 0.726 | -3.982 | -1.073 | 1.667 | 3.060 | -1.690 | 2.846 | 2.235 | 6.667 | 0.444 | -2.369 | 2.951 | -3.480 | 0 |
| 1 | 3.366 | 3.653 | 0.910 | -1.368 | 0.332 | 2.359 | 0.733 | -4.332 | 0.566 | -0.101 | 1.914 | -0.951 | -1.255 | -2.707 | 0.193 | -4.769 | -2.205 | 0.908 | 0.757 | -5.834 | -3.065 | 1.597 | -1.757 | 1.766 | -0.267 | 3.625 | 1.500 | -0.586 | 0.783 | -0.201 | 0.025 | -1.795 | 3.033 | -2.468 | 1.895 | -2.298 | -1.731 | 5.909 | -0.386 | 0.616 | 0 |
| 2 | -3.832 | -5.824 | 0.634 | -2.419 | -1.774 | 1.017 | -2.099 | -3.173 | -2.082 | 5.393 | -0.771 | 1.107 | 1.144 | 0.943 | -3.164 | -4.248 | -4.039 | 3.689 | 3.311 | 1.059 | -2.143 | 1.650 | -1.661 | 1.680 | -0.451 | -4.551 | 3.739 | 1.134 | -2.034 | 0.841 | -1.600 | -0.257 | 0.804 | 4.086 | 2.292 | 5.361 | 0.352 | 2.940 | 3.839 | -4.309 | 0 |
| 3 | 1.618 | 1.888 | 7.046 | -1.147 | 0.083 | -1.530 | 0.207 | -2.494 | 0.345 | 2.119 | -3.053 | 0.460 | 2.705 | -0.636 | -0.454 | -3.174 | -3.404 | -1.282 | 1.582 | -1.952 | -3.517 | -1.206 | -5.628 | -1.818 | 2.124 | 5.295 | 4.748 | -2.309 | -3.963 | -6.029 | 4.949 | -3.584 | -2.577 | 1.364 | 0.623 | 5.550 | -1.527 | 0.139 | 3.101 | -1.277 | 0 |
| 4 | -0.111 | 3.872 | -3.758 | -2.983 | 3.793 | 0.545 | 0.205 | 4.849 | -1.855 | -6.220 | 1.998 | 4.724 | 0.709 | -1.989 | -2.633 | 4.184 | 2.245 | 3.734 | -6.313 | -5.380 | -0.887 | 2.062 | 9.446 | 4.490 | -3.945 | 4.582 | -8.780 | -3.383 | 5.107 | 6.788 | 2.044 | 8.266 | 6.629 | -10.069 | 1.223 | -3.230 | 1.687 | -2.164 | -3.645 | 6.510 | 0 |
df.tail()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19995 | -2.071 | -1.088 | -0.796 | -3.012 | -2.288 | 2.807 | 0.481 | 0.105 | -0.587 | -2.899 | 8.868 | 1.717 | 1.358 | -1.777 | 0.710 | 4.945 | -3.100 | -1.199 | -1.085 | -0.365 | 3.131 | -3.948 | -3.578 | -8.139 | -1.937 | -1.328 | -0.403 | -1.735 | 9.996 | 6.955 | -3.938 | -8.274 | 5.745 | 0.589 | -0.650 | -3.043 | 2.216 | 0.609 | 0.178 | 2.928 | 1 |
| 19996 | 2.890 | 2.483 | 5.644 | 0.937 | -1.381 | 0.412 | -1.593 | -5.762 | 2.150 | 0.272 | -2.095 | -1.526 | 0.072 | -3.540 | -2.762 | -10.632 | -0.495 | 1.720 | 3.872 | -1.210 | -8.222 | 2.121 | -5.492 | 1.452 | 1.450 | 3.685 | 1.077 | -0.384 | -0.839 | -0.748 | -1.089 | -4.159 | 1.181 | -0.742 | 5.369 | -0.693 | -1.669 | 3.660 | 0.820 | -1.987 | 0 |
| 19997 | -3.897 | -3.942 | -0.351 | -2.417 | 1.108 | -1.528 | -3.520 | 2.055 | -0.234 | -0.358 | -3.782 | 2.180 | 6.112 | 1.985 | -8.330 | -1.639 | -0.915 | 5.672 | -3.924 | 2.133 | -4.502 | 2.777 | 5.728 | 1.620 | -1.700 | -0.042 | -2.923 | -2.760 | -2.254 | 2.552 | 0.982 | 7.112 | 1.476 | -3.954 | 1.856 | 5.029 | 2.083 | -6.409 | 1.477 | -0.874 | 0 |
| 19998 | -3.187 | -10.052 | 5.696 | -4.370 | -5.355 | -1.873 | -3.947 | 0.679 | -2.389 | 5.457 | 1.583 | 3.571 | 9.227 | 2.554 | -7.039 | -0.994 | -9.665 | 1.155 | 3.877 | 3.524 | -7.015 | -0.132 | -3.446 | -4.801 | -0.876 | -3.812 | 5.422 | -3.732 | 0.609 | 5.256 | 1.915 | 0.403 | 3.164 | 3.752 | 8.530 | 8.451 | 0.204 | -7.130 | 4.249 | -6.112 | 0 |
| 19999 | -2.687 | 1.961 | 6.137 | 2.600 | 2.657 | -4.291 | -2.344 | 0.974 | -1.027 | 0.497 | -9.589 | 3.177 | 1.055 | -1.416 | -4.669 | -5.405 | 3.720 | 2.893 | 2.329 | 1.458 | -6.429 | 1.818 | 0.806 | 7.786 | 0.331 | 5.257 | -4.867 | -0.819 | -5.667 | -2.861 | 4.674 | 6.621 | -1.989 | -1.349 | 3.952 | 5.450 | -0.455 | -2.202 | 1.678 | -1.974 | 0 |
df_test.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.613 | -3.820 | 2.202 | 1.300 | -1.185 | -4.496 | -1.836 | 4.723 | 1.206 | -0.342 | -5.123 | 1.017 | 4.819 | 3.269 | -2.984 | 1.387 | 2.032 | -0.512 | -1.023 | 7.339 | -2.242 | 0.155 | 2.054 | -2.772 | 1.851 | -1.789 | -0.277 | -1.255 | -3.833 | -1.505 | 1.587 | 2.291 | -5.411 | 0.870 | 0.574 | 4.157 | 1.428 | -10.511 | 0.455 | -1.448 | 0 |
| 1 | 0.390 | -0.512 | 0.527 | -2.577 | -1.017 | 2.235 | -0.441 | -4.406 | -0.333 | 1.967 | 1.797 | 0.410 | 0.638 | -1.390 | -1.883 | -5.018 | -3.827 | 2.418 | 1.762 | -3.242 | -3.193 | 1.857 | -1.708 | 0.633 | -0.588 | 0.084 | 3.014 | -0.182 | 0.224 | 0.865 | -1.782 | -2.475 | 2.494 | 0.315 | 2.059 | 0.684 | -0.485 | 5.128 | 1.721 | -1.488 | 0 |
| 2 | -0.875 | -0.641 | 4.084 | -1.590 | 0.526 | -1.958 | -0.695 | 1.347 | -1.732 | 0.466 | -4.928 | 3.565 | -0.449 | -0.656 | -0.167 | -1.630 | 2.292 | 2.396 | 0.601 | 1.794 | -2.120 | 0.482 | -0.841 | 1.790 | 1.874 | 0.364 | -0.169 | -0.484 | -2.119 | -2.157 | 2.907 | -1.319 | -2.997 | 0.460 | 0.620 | 5.632 | 1.324 | -1.752 | 1.808 | 1.676 | 0 |
| 3 | 0.238 | 1.459 | 4.015 | 2.534 | 1.197 | -3.117 | -0.924 | 0.269 | 1.322 | 0.702 | -5.578 | -0.851 | 2.591 | 0.767 | -2.391 | -2.342 | 0.572 | -0.934 | 0.509 | 1.211 | -3.260 | 0.105 | -0.659 | 1.498 | 1.100 | 4.143 | -0.248 | -1.137 | -5.356 | -4.546 | 3.809 | 3.518 | -3.074 | -0.284 | 0.955 | 3.029 | -1.367 | -3.412 | 0.906 | -2.451 | 0 |
| 4 | 5.828 | 2.768 | -1.235 | 2.809 | -1.642 | -1.407 | 0.569 | 0.965 | 1.918 | -2.775 | -0.530 | 1.375 | -0.651 | -1.679 | -0.379 | -4.443 | 3.894 | -0.608 | 2.945 | 0.367 | -5.789 | 4.598 | 4.450 | 3.225 | 0.397 | 0.248 | -2.362 | 1.079 | -0.473 | 2.243 | -3.591 | 1.774 | -1.502 | -2.227 | 4.777 | -6.560 | -0.806 | -0.276 | -3.858 | -0.538 | 0 |
df_test.tail()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | -5.120 | 1.635 | 1.251 | 4.036 | 3.291 | -2.932 | -1.329 | 1.754 | -2.985 | 1.249 | -6.878 | 3.715 | -2.512 | -1.395 | -2.554 | -2.197 | 4.772 | 2.403 | 3.792 | 0.487 | -2.028 | 1.778 | 3.668 | 11.375 | -1.977 | 2.252 | -7.319 | 1.907 | -3.734 | -0.012 | 2.120 | 9.979 | 0.063 | 0.217 | 3.036 | 2.109 | -0.557 | 1.939 | 0.513 | -2.694 | 0 |
| 4996 | -5.172 | 1.172 | 1.579 | 1.220 | 2.530 | -0.669 | -2.618 | -2.001 | 0.634 | -0.579 | -3.671 | 0.460 | 3.321 | -1.075 | -7.113 | -4.356 | -0.001 | 3.698 | -0.846 | -0.222 | -3.645 | 0.736 | 0.926 | 3.278 | -2.277 | 4.458 | -4.543 | -1.348 | -1.779 | 0.352 | -0.214 | 4.424 | 2.604 | -2.152 | 0.917 | 2.157 | 0.467 | 0.470 | 2.197 | -2.377 | 0 |
| 4997 | -1.114 | -0.404 | -1.765 | -5.879 | 3.572 | 3.711 | -2.483 | -0.308 | -0.922 | -2.999 | -0.112 | -1.977 | -1.623 | -0.945 | -2.735 | -0.813 | 0.610 | 8.149 | -9.199 | -3.872 | -0.296 | 1.468 | 2.884 | 2.792 | -1.136 | 1.198 | -4.342 | -2.869 | 4.124 | 4.197 | 3.471 | 3.792 | 7.482 | -10.061 | -0.387 | 1.849 | 1.818 | -1.246 | -1.261 | 7.475 | 0 |
| 4998 | -1.703 | 0.615 | 6.221 | -0.104 | 0.956 | -3.279 | -1.634 | -0.104 | 1.388 | -1.066 | -7.970 | 2.262 | 3.134 | -0.486 | -3.498 | -4.562 | 3.136 | 2.536 | -0.792 | 4.398 | -4.073 | -0.038 | -2.371 | -1.542 | 2.908 | 3.215 | -0.169 | -1.541 | -4.724 | -5.525 | 1.668 | -4.100 | -5.949 | 0.550 | -1.574 | 6.824 | 2.139 | -4.036 | 3.436 | 0.579 | 0 |
| 4999 | -0.604 | 0.960 | -0.721 | 8.230 | -1.816 | -2.276 | -2.575 | -1.041 | 4.130 | -2.731 | -3.292 | -1.674 | 0.465 | -1.646 | -5.263 | -7.988 | 6.480 | 0.226 | 4.963 | 6.752 | -6.306 | 3.271 | 1.897 | 3.271 | -0.637 | -0.925 | -6.759 | 2.990 | -0.814 | 3.499 | -8.435 | 2.370 | -1.062 | 0.791 | 4.952 | -7.441 | -0.070 | -0.918 | -2.291 | -5.363 | 0 |
#Info for df
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
#Info for test set
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 4995 non-null float64 1 V2 4994 non-null float64 2 V3 5000 non-null float64 3 V4 5000 non-null float64 4 V5 5000 non-null float64 5 V6 5000 non-null float64 6 V7 5000 non-null float64 7 V8 5000 non-null float64 8 V9 5000 non-null float64 9 V10 5000 non-null float64 10 V11 5000 non-null float64 11 V12 5000 non-null float64 12 V13 5000 non-null float64 13 V14 5000 non-null float64 14 V15 5000 non-null float64 15 V16 5000 non-null float64 16 V17 5000 non-null float64 17 V18 5000 non-null float64 18 V19 5000 non-null float64 19 V20 5000 non-null float64 20 V21 5000 non-null float64 21 V22 5000 non-null float64 22 V23 5000 non-null float64 23 V24 5000 non-null float64 24 V25 5000 non-null float64 25 V26 5000 non-null float64 26 V27 5000 non-null float64 27 V28 5000 non-null float64 28 V29 5000 non-null float64 29 V30 5000 non-null float64 30 V31 5000 non-null float64 31 V32 5000 non-null float64 32 V33 5000 non-null float64 33 V34 5000 non-null float64 34 V35 5000 non-null float64 35 V36 5000 non-null float64 36 V37 5000 non-null float64 37 V38 5000 non-null float64 38 V39 5000 non-null float64 39 V40 5000 non-null float64 40 Target 5000 non-null int64 dtypes: float64(40), int64(1) memory usage: 1.6 MB
All variables in the training and test sets are numerical.
We will check for duplicates and missing values.
#Check df for duplicates
df.duplicated().sum()
0
#Chest test data for duplicates
df_test.duplicated().sum()
0
There are no duplicates in the training df or the test set.
We will now look for null or missing values in the training and test sets.
#Check df for null values
df.isnull().sum()
| 0 | |
|---|---|
| V1 | 18 |
| V2 | 18 |
| V3 | 0 |
| V4 | 0 |
| V5 | 0 |
| V6 | 0 |
| V7 | 0 |
| V8 | 0 |
| V9 | 0 |
| V10 | 0 |
| V11 | 0 |
| V12 | 0 |
| V13 | 0 |
| V14 | 0 |
| V15 | 0 |
| V16 | 0 |
| V17 | 0 |
| V18 | 0 |
| V19 | 0 |
| V20 | 0 |
| V21 | 0 |
| V22 | 0 |
| V23 | 0 |
| V24 | 0 |
| V25 | 0 |
| V26 | 0 |
| V27 | 0 |
| V28 | 0 |
| V29 | 0 |
| V30 | 0 |
| V31 | 0 |
| V32 | 0 |
| V33 | 0 |
| V34 | 0 |
| V35 | 0 |
| V36 | 0 |
| V37 | 0 |
| V38 | 0 |
| V39 | 0 |
| V40 | 0 |
| Target | 0 |
The training set is missing 18 values in V1 and V2 columns.
#Check test data for null values
df_test.isnull().sum()
| 0 | |
|---|---|
| V1 | 5 |
| V2 | 6 |
| V3 | 0 |
| V4 | 0 |
| V5 | 0 |
| V6 | 0 |
| V7 | 0 |
| V8 | 0 |
| V9 | 0 |
| V10 | 0 |
| V11 | 0 |
| V12 | 0 |
| V13 | 0 |
| V14 | 0 |
| V15 | 0 |
| V16 | 0 |
| V17 | 0 |
| V18 | 0 |
| V19 | 0 |
| V20 | 0 |
| V21 | 0 |
| V22 | 0 |
| V23 | 0 |
| V24 | 0 |
| V25 | 0 |
| V26 | 0 |
| V27 | 0 |
| V28 | 0 |
| V29 | 0 |
| V30 | 0 |
| V31 | 0 |
| V32 | 0 |
| V33 | 0 |
| V34 | 0 |
| V35 | 0 |
| V36 | 0 |
| V37 | 0 |
| V38 | 0 |
| V39 | 0 |
| V40 | 0 |
| Target | 0 |
The test data is missing some data in the V1 and V2 columns as well. We will treat these missing values later before using the data.
Looking at the statistical summary for the data
#Stat summary for df
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| V1 | 19982.000 | -0.272 | 3.442 | -11.876 | -2.737 | -0.748 | 1.840 | 15.493 |
| V2 | 19982.000 | 0.440 | 3.151 | -12.320 | -1.641 | 0.472 | 2.544 | 13.089 |
| V3 | 20000.000 | 2.485 | 3.389 | -10.708 | 0.207 | 2.256 | 4.566 | 17.091 |
| V4 | 20000.000 | -0.083 | 3.432 | -15.082 | -2.348 | -0.135 | 2.131 | 13.236 |
| V5 | 20000.000 | -0.054 | 2.105 | -8.603 | -1.536 | -0.102 | 1.340 | 8.134 |
| V6 | 20000.000 | -0.995 | 2.041 | -10.227 | -2.347 | -1.001 | 0.380 | 6.976 |
| V7 | 20000.000 | -0.879 | 1.762 | -7.950 | -2.031 | -0.917 | 0.224 | 8.006 |
| V8 | 20000.000 | -0.548 | 3.296 | -15.658 | -2.643 | -0.389 | 1.723 | 11.679 |
| V9 | 20000.000 | -0.017 | 2.161 | -8.596 | -1.495 | -0.068 | 1.409 | 8.138 |
| V10 | 20000.000 | -0.013 | 2.193 | -9.854 | -1.411 | 0.101 | 1.477 | 8.108 |
| V11 | 20000.000 | -1.895 | 3.124 | -14.832 | -3.922 | -1.921 | 0.119 | 11.826 |
| V12 | 20000.000 | 1.605 | 2.930 | -12.948 | -0.397 | 1.508 | 3.571 | 15.081 |
| V13 | 20000.000 | 1.580 | 2.875 | -13.228 | -0.224 | 1.637 | 3.460 | 15.420 |
| V14 | 20000.000 | -0.951 | 1.790 | -7.739 | -2.171 | -0.957 | 0.271 | 5.671 |
| V15 | 20000.000 | -2.415 | 3.355 | -16.417 | -4.415 | -2.383 | -0.359 | 12.246 |
| V16 | 20000.000 | -2.925 | 4.222 | -20.374 | -5.634 | -2.683 | -0.095 | 13.583 |
| V17 | 20000.000 | -0.134 | 3.345 | -14.091 | -2.216 | -0.015 | 2.069 | 16.756 |
| V18 | 20000.000 | 1.189 | 2.592 | -11.644 | -0.404 | 0.883 | 2.572 | 13.180 |
| V19 | 20000.000 | 1.182 | 3.397 | -13.492 | -1.050 | 1.279 | 3.493 | 13.238 |
| V20 | 20000.000 | 0.024 | 3.669 | -13.923 | -2.433 | 0.033 | 2.512 | 16.052 |
| V21 | 20000.000 | -3.611 | 3.568 | -17.956 | -5.930 | -3.533 | -1.266 | 13.840 |
| V22 | 20000.000 | 0.952 | 1.652 | -10.122 | -0.118 | 0.975 | 2.026 | 7.410 |
| V23 | 20000.000 | -0.366 | 4.032 | -14.866 | -3.099 | -0.262 | 2.452 | 14.459 |
| V24 | 20000.000 | 1.134 | 3.912 | -16.387 | -1.468 | 0.969 | 3.546 | 17.163 |
| V25 | 20000.000 | -0.002 | 2.017 | -8.228 | -1.365 | 0.025 | 1.397 | 8.223 |
| V26 | 20000.000 | 1.874 | 3.435 | -11.834 | -0.338 | 1.951 | 4.130 | 16.836 |
| V27 | 20000.000 | -0.612 | 4.369 | -14.905 | -3.652 | -0.885 | 2.189 | 17.560 |
| V28 | 20000.000 | -0.883 | 1.918 | -9.269 | -2.171 | -0.891 | 0.376 | 6.528 |
| V29 | 20000.000 | -0.986 | 2.684 | -12.579 | -2.787 | -1.176 | 0.630 | 10.722 |
| V30 | 20000.000 | -0.016 | 3.005 | -14.796 | -1.867 | 0.184 | 2.036 | 12.506 |
| V31 | 20000.000 | 0.487 | 3.461 | -13.723 | -1.818 | 0.490 | 2.731 | 17.255 |
| V32 | 20000.000 | 0.304 | 5.500 | -19.877 | -3.420 | 0.052 | 3.762 | 23.633 |
| V33 | 20000.000 | 0.050 | 3.575 | -16.898 | -2.243 | -0.066 | 2.255 | 16.692 |
| V34 | 20000.000 | -0.463 | 3.184 | -17.985 | -2.137 | -0.255 | 1.437 | 14.358 |
| V35 | 20000.000 | 2.230 | 2.937 | -15.350 | 0.336 | 2.099 | 4.064 | 15.291 |
| V36 | 20000.000 | 1.515 | 3.801 | -14.833 | -0.944 | 1.567 | 3.984 | 19.330 |
| V37 | 20000.000 | 0.011 | 1.788 | -5.478 | -1.256 | -0.128 | 1.176 | 7.467 |
| V38 | 20000.000 | -0.344 | 3.948 | -17.375 | -2.988 | -0.317 | 2.279 | 15.290 |
| V39 | 20000.000 | 0.891 | 1.753 | -6.439 | -0.272 | 0.919 | 2.058 | 7.760 |
| V40 | 20000.000 | -0.876 | 3.012 | -11.024 | -2.940 | -0.921 | 1.120 | 10.654 |
| Target | 20000.000 | 0.056 | 0.229 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
#Find the value count of target vaiable in df
df["Target"].value_counts()
| count | |
|---|---|
| Target | |
| 0 | 18890 |
| 1 | 1110 |
The data has 1110 failures and 18890 non-failures represented.
We can see the data is skewed toward the 0 target variable. We will address this later when training our models.
#Find the value count of target in the test data
df_test["Target"].value_counts()
| count | |
|---|---|
| Target | |
| 0 | 4718 |
| 1 | 282 |
The test set shows 282 failures and 4718 non-failures, a similar ratio as the training data set.
Exploratory Data Analysis (EDA)¶
Plotting histograms and boxplots for all the variables¶
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Plotting all the features at one go¶
for feature in df.columns:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
Observations:
The target variable is heavily weighted to non-failures.
Slight difference between Mean and Median for these below features:
Right skewed:V1,V3,V18,V27,V29,V32,V37 Left skweed: V8,V10,V16,V30,V34
All other variables are fairly normally distributed.
We will look at how these distributions compare to the test set.
for feature in df_test.columns:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
The test data distributions are similar to the observations we see with the train data. It looks like the two datasets represent each other well!
Multivariate Analysis¶
#heatmap for df
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
#heatmap for df_test
plt.figure(figsize=(20, 20))
sns.heatmap(df_test.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
We can see the training set and the test set share similar correlations between variables.
We can see evidence of strong positive & negative correlations between variables. However, there is not an obvious strong correlation between one variable and the target.
V15, V18, & V21 show the "strongest" relationship compared to the other features.
# Displot with a hue on Target variable
for feature in df.columns:
sns.displot(df, x=feature, hue="Target")
V18 - When V18 is less than -4, there is a strong incidence of failures, and a low representation of the machinery not failing once it reaches this range.
V21 - When V21 is greater than 5, there is a strong incidence of failure, with a low "survival" rate for the machinery in this range.
# Detecting outliers in the data using boxplot
cols_list = df.select_dtypes(include=np.number).columns.tolist()
# dropping yr_of_estab as it is a temporal variable
# cols_list.remove("yr_of_estab")
plt.figure(figsize=(15, 45))
# for i, variable in enumerate(cols_list):
for i in range(len(cols_list)):
plt.subplot(15, 3, i + 1)
sns.boxplot(data=df, x=cols_list[i]) # , kde=True)
# cols_list.plot(kind='box',subplots=False, ax=axis)
plt.tight_layout()
plt.title(i + 1)
plt.show()
There are outliers throughout the dataset. We will keep these for analysis.
Data Pre-processing¶
# Dividing train data into X and y
X = df.drop(["Target"], axis=1)
y = df["Target"]
# Dividing test data into X_test and y_test
X_test = df_test.drop(["Target"], axis=1)
y_test = df_test["Target"]
# Splitting data into training and validation set:
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.25, random_state=1, stratify=y
)
print(X_train.shape, X_val.shape, X_test.shape)
(15000, 40) (5000, 40) (5000, 40)
The train data set is divided into 75-25% to train and validation data.
Test data set is untouched and is retained as is.
Missing value imputation¶
We will impute missing values after the split to ensure no data leakage from df to the validation data and validation to the train data.
Since the data was already split, we have avoided any leakage.
# Let's impute the missing values in columns V1 and V2
imputer = SimpleImputer(strategy="median")
cols_to_impute = ["V1", "V2"]
# fit and transform the imputer on train data
X_train[cols_to_impute] = imputer.fit_transform(X_train[cols_to_impute])
# Transform on validation and test data
X_val[cols_to_impute] = imputer.transform(X_val[cols_to_impute])
# fit and transform the imputer on test data
X_test[cols_to_impute] = imputer.transform(X_test[cols_to_impute])
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 ------------------------------ V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64 ------------------------------ V1 0 V2 0 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 dtype: int64
There is no missing data in the train, validation, and test data.
Model Building¶
Model evaluation criterion¶
The nature of predictions made by the classification model will translate as follows:
- True positives (TP) are failures correctly predicted by the model.
- False negatives (FN) are real failures in a generator where there is no detection by model.
- False positives (FP) are failure detections in a generator where there is no failure.
Which metric to optimize?
- We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
- We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
- We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Defining scorer to be used for cross-validation and hyperparameter tuning¶
- We want to reduce false negatives and will try to maximize "Recall".
- To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Model Building with original data¶
Sample Decision Tree model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("Logistict Regression", LogisticRegression(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Score:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Cross-Validation Score: Bagging: 0.7210807301060529 Random forest: 0.7235192266070268 GBM: 0.7066661857008874 Adaboost: 0.6309140754635308 Xgboost: 0.8100497799581561 dtree: 0.6982829521679532 Logistict Regression: 0.4927566553639709 Validation Performance: Bagging: 0.7302158273381295 Random forest: 0.7266187050359713 GBM: 0.7230215827338129 Adaboost: 0.6762589928057554 Xgboost: 0.8309352517985612 dtree: 0.7050359712230215 Logistict Regression: 0.4856115107913669
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
XGBoost is performing the best (.82 recall on validation), followed by Bagging and Random Forest(both at .73 recall on validation set.
Model Building with Oversampled data¶
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label '1': 832 Before OverSampling, counts of label '0': 14168 After OverSampling, counts of label '1': 14168 After OverSampling, counts of label '0': 14168 After OverSampling, the shape of train_X: (28336, 40) After OverSampling, the shape of train_y: (28336,)
models1 = [] # Empty list to store all the models created with over sampling
# Appending models into the list
models1.append(("Bagging with Over Sampling", BaggingClassifier(random_state=1)))
models1.append(
("Random forest with Over Sampling", RandomForestClassifier(random_state=1))
)
models1.append(("GBM with Over Sampling", GradientBoostingClassifier(random_state=1)))
models1.append(("Adaboost with Over Sampling", AdaBoostClassifier(random_state=1)))
models1.append(
("Xgboost with Over Sampling", XGBClassifier(random_state=1, eval_metric="logloss"))
)
models1.append(("dtree with Over Sampling", DecisionTreeClassifier(random_state=1)))
models1.append(
("Logistic Regression with Over Sampling", LogisticRegression(random_state=1))
)
results1 = [] # Empty list to store all model's CV scores
names1 = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Score:" "\n")
for name, model in models1:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result1 = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
)
results1.append(cv_result1)
names1.append(name)
print("{}: {}".format(name, cv_result1.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models1:
model.fit(X_train_over, y_train_over)
scores1_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores1_val))
Cross-Validation Score: Bagging with Over Sampling: 0.9762141471581656 Random forest with Over Sampling: 0.9839075260047615 GBM with Over Sampling: 0.9256068151319724 Adaboost with Over Sampling: 0.8978689011775473 Xgboost with Over Sampling: 0.9891305241357218 dtree with Over Sampling: 0.9720494245534969 Logistic Regression with Over Sampling: 0.883963699328486 Validation Performance: Bagging with Over Sampling: 0.8345323741007195 Random forest with Over Sampling: 0.8489208633093526 GBM with Over Sampling: 0.8776978417266187 Adaboost with Over Sampling: 0.8561151079136691 Xgboost with Over Sampling: 0.8669064748201439 dtree with Over Sampling: 0.7769784172661871 Logistic Regression with Over Sampling: 0.8489208633093526
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results1)
ax.set_xticklabels(names1)
plt.xticks(rotation=60)
plt.show()
After over-sampling, XGBoost is performing the best on the cross-validation set with a score of .99 recall.
Random forest also did very well with a .98 recall cross validation score.
Model Building with Undersampled data¶
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before UnderSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UnderSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After UnderSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After UnderSampling, the shape of train_X: {}".format(X_train_un.shape))
print("After UnderSampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before UnderSampling, counts of label '1': 832 Before UnderSampling, counts of label '0': 14168 After UnderSampling, counts of label '1': 832 After UnderSampling, counts of label '0': 832 After UnderSampling, the shape of train_X: (1664, 40) After UnderSampling, the shape of train_y: (1664,)
models2 = [] # Empty list to store all the models created with under sampling
# Appending models into the list
models2.append(("Bagging with Under Sampling", BaggingClassifier(random_state=1)))
models2.append(
("Random forest with Under Sampling", RandomForestClassifier(random_state=1))
)
models2.append(("GBM with Under Sampling", GradientBoostingClassifier(random_state=1)))
models2.append(("Adaboost with Under Sampling", AdaBoostClassifier(random_state=1)))
models2.append(
(
"Xgboost with Under Sampling",
XGBClassifier(random_state=1, eval_metric="logloss"),
)
)
models2.append(("dtree with Under Sampling", DecisionTreeClassifier(random_state=1)))
models2.append(
("Logistic Regression with Under Sampling", LogisticRegression(random_state=1))
)
results2 = [] # Empty list to store all model's CV scores
names2 = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Score:" "\n")
for name, model in models2:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result2 = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
)
results2.append(cv_result2)
names2.append(name)
print("{}: {}".format(name, cv_result2.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models2:
model.fit(X_train_un, y_train_un)
scores2 = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores2))
Cross-Validation Score: Bagging with Under Sampling: 0.8641945025611427 Random forest with Under Sampling: 0.9038669648654498 GBM with Under Sampling: 0.8978572974532861 Adaboost with Under Sampling: 0.8666113556020489 Xgboost with Under Sampling: 0.9014717552846114 dtree with Under Sampling: 0.8617776495202367 Logistic Regression with Under Sampling: 0.8726138085275232 Validation Performance: Bagging with Under Sampling: 0.8705035971223022 Random forest with Under Sampling: 0.8920863309352518 GBM with Under Sampling: 0.8884892086330936 Adaboost with Under Sampling: 0.8489208633093526 Xgboost with Under Sampling: 0.89568345323741 dtree with Under Sampling: 0.841726618705036 Logistic Regression with Under Sampling: 0.8525179856115108
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results2)
ax.set_xticklabels(names2)
plt.xticks(rotation=60)
plt.show()
With Under Sampling; the average Recall score is the highest for XGBoost being the top followed by Random Forest.
For Hyperparameter Tuning we will pick the top two performers from over sampling and under sampling. The models built on the original dataset are not performing nearly as well as these. The four models we will carry into hyperparameter tuning are:
XGBoost with Over Sampling
Random Forest with Over Sampling
XGBoost with Under sampling
RandomForest with Under sampling
HyperparameterTuning¶
Sample Parameter Grids¶
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
- For Gradient Boosting:
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
- For Adaboost:
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
- For Bagging Classifier:
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
- For Random Forest:
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
- For Decision Trees:
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
- For Logistic Regression:
param_grid = {'C': np.arange(0.1,1.1,0.1)}
- For XGBoost:
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
Hyperparamter tuning for XGBoost with oversampled data using Random Search CV¶
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':[150,200,250],
'scale_pos_weight':[5,10],
'learning_rate':[0.1,0.2],
'gamma':[0,3,5],
'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 250, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.9959768939564728:
CPU times: user 15.7 s, sys: 1.83 s, total: 17.5 s
Wall time: 9min 46s
# Creating new model with best parameters
xgb_tuned_over = XGBClassifier(
n_estimators=250, scale_pos_weight=10, learning_rate=0.1, gamma=3, subsample=0.8
)
xgb_tuned_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=250, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=3, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=250, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)xgb_tuned_over_train_perf = model_performance_classification_sklearn(
xgb_tuned_over, X_train_over, y_train_over
)
print("Train Performance:")
xgb_tuned_over_train_perf
Train Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997 | 1.000 | 0.993 | 0.997 |
xgb_tuned_over_val_perf = model_performance_classification_sklearn(xgb_tuned_over, X_val, y_val)
print("Validation Performance:")
xgb_tuned_over_val_perf
Validation Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.976 | 0.885 | 0.734 | 0.803 |
#Create a confusion matrix for xgb_tuned_over
confusion_matrix_sklearn(xgb_tuned_over, X_val, y_val)
The xgb_tuned model with oversampling has improved recall from .866(without tuning) to .885(with tuning) on the validation set.
It does look to be overfitting on the train data with a recall score of 1.
Hyperparamter tuning for Random Forest with oversampled data using Random Search CV¶
%%time
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1) }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9815078165615898:
CPU times: user 1min 2s, sys: 1.81 s, total: 1min 3s
Wall time: 16min 7s
# Creating new model with best parameters
rf_tuned_over = RandomForestClassifier(
n_estimators=300, min_samples_leaf=1, max_features="sqrt", max_samples=0.6
)
rf_tuned_over.fit(X_train_over, y_train_over)
RandomForestClassifier(max_samples=0.6, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
rf_tuned_over_train_perf = model_performance_classification_sklearn(
rf_tuned_over, X_train_over, y_train_over
)
print("Train Performance:")
rf_tuned_over_train_perf
Train Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 0.999 | 1.000 | 1.000 |
rf_tuned_over_val_perf = model_performance_classification_sklearn(
rf_tuned_over, X_val, y_val
)
print("Validation Performance:")
rf_tuned_over_val_perf
Validation Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.989 | 0.863 | 0.930 | 0.896 |
#Create confusion matrix for the rf_tuned_over
confusion_matrix_sklearn(rf_tuned_over, X_val, y_val)
The tuned random forest model on oversampled data is performing better than its untuned version, but not as well as the tuned XGBoost model. The recall score for the tuned XGBoost is higher than the tuned random forest model.
This tuned model also seems to overfit on the training data.
Hyperparamter tuning for XGBoost with undersampled data using Random Search CV¶
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':[150,200,250],
'scale_pos_weight':[5,10],
'learning_rate':[0.1,0.2],
'gamma':[0,3,5],
'subsample':[0.8,0.9]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9314695909386048:
CPU times: user 3.44 s, sys: 305 ms, total: 3.75 s
Wall time: 2min 27s
# Creating new model with best parameters
xgb_tuned_un = XGBClassifier(
n_estimators=200, scale_pos_weight=10, learning_rate=0.1, gamma=5, subsample=0.8
)
xgb_tuned_un.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)xgb_tuned_un_train_perf = model_performance_classification_sklearn(
xgb_tuned_un, X_train_un, y_train_un
)
print("Train Performance:")
xgb_tuned_un_train_perf
Train Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.977 | 1.000 | 0.955 | 0.977 |
xgb_tuned_un_val_perf = model_performance_classification_sklearn(xgb_tuned_un, X_val, y_val)
print("Validation Performance:")
xgb_tuned_un_val_perf
Validation Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.834 | 0.917 | 0.240 | 0.381 |
#Create a confusion matrix for xgb_tuned_un
confusion_matrix_sklearn(xgb_tuned_un, X_val, y_val)
The tuned xgboost on undersampled data has improved recall to .92 but there is a significant drop in precision. Also the accuracy of this model is lower than the tuned xgboost on oversampled data.
This model is also overfitting on the train data.
Hyperparamter tuning for Random Forest with undersampled data using Grid Search¶
%%time
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
params = {
"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)}
#Calling GridSearchCV
grid_cv = GridSearchCV(estimator=Model, param_grid=params, n_jobs = -1, scoring=scorer, cv=5)
#Fitting parameters in RandomizedSearchCV
grid_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(grid_cv.best_params_,grid_cv.best_score_))
Best parameters are {'max_features': 'sqrt', 'max_samples': 0.5, 'min_samples_leaf': 1, 'n_estimators': 250} with CV score=0.8990188298102592:
CPU times: user 3.45 s, sys: 304 ms, total: 3.76 s
Wall time: 2min 37s
# Creating new model with best parameters
rf_tuned_grid = RandomForestClassifier(
n_estimators=250, min_samples_leaf=1, max_features="sqrt", max_samples=0.5
)
rf_tuned_grid.fit(X_train_un, y_train_un)
RandomForestClassifier(max_samples=0.5, n_estimators=250)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.5, n_estimators=250)
rf_tuned_grid_train_perf = model_performance_classification_sklearn(
rf_tuned_grid, X_train_un, y_train_un
)
print("Train Performance:")
rf_tuned_grid_train_perf
Train Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.975 | 0.951 | 0.999 | 0.974 |
rf_tuned_grid_val_perf = model_performance_classification_sklearn(
rf_tuned_grid, X_val, y_val
)
print("Validation Performance:")
rf_tuned_grid_val_perf
Validation Performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937 | 0.888 | 0.466 | 0.611 |
#Create a confusion matrix for rf_tuned_grid
confusion_matrix_sklearn(rf_tuned_grid, X_val, y_val)
The model has improved recall after tuning. We will look at a comparison of all model performances to choose the final model.
Model performance comparison and choosing the final model¶
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb_tuned_over_train_perf.T,
xgb_tuned_un_train_perf.T,
rf_tuned_over_train_perf.T,
rf_tuned_grid_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"XGBoost tuned with oversampled data",
"XGBoost tuned with undersampled data",
"Random forest tuned with oversampled data",
"Random forest tuned with undersampled data-GridSearch CV",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| XGBoost tuned with oversampled data | XGBoost tuned with undersampled data | Random forest tuned with oversampled data | Random forest tuned with undersampled data-GridSearch CV | |
|---|---|---|---|---|
| Accuracy | 0.997 | 0.977 | 1.000 | 0.975 |
| Recall | 1.000 | 1.000 | 0.999 | 0.951 |
| Precision | 0.993 | 0.955 | 1.000 | 0.999 |
| F1 | 0.997 | 0.977 | 1.000 | 0.974 |
# validation performance comparison
models_val_comp_df = pd.concat(
[
xgb_tuned_over_val_perf.T,
xgb_tuned_un_val_perf.T,
rf_tuned_over_val_perf.T,
rf_tuned_grid_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"XGBoost tuned with oversampled data",
"XGBoost tuned with undersampled data",
"Random forest tuned with oversampled data",
"Random forest tuned with undersampled data-GridSearch CV",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| XGBoost tuned with oversampled data | XGBoost tuned with undersampled data | Random forest tuned with oversampled data | Random forest tuned with undersampled data-GridSearch CV | |
|---|---|---|---|---|
| Accuracy | 0.976 | 0.834 | 0.989 | 0.937 |
| Recall | 0.885 | 0.917 | 0.863 | 0.888 |
| Precision | 0.734 | 0.240 | 0.930 | 0.466 |
| F1 | 0.803 | 0.381 | 0.896 | 0.611 |
With Recall being the top priority and our "scorer" for this situation, we will choose XGBoost tuned with the undersampled data as our model. This model has the highest performance of recall on the validation set.
Test set final performance¶
# Calculating different metrics on the test set
xgb_tuned_un_test = model_performance_classification_sklearn(
xgb_tuned_un, X_test, y_test
)
print("Test performance:")
xgb_tuned_un_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.842 | 0.890 | 0.248 | 0.388 |
The XGBoost tuned with undersampled data gave a recall score of .89 on the test data set. It seems to be performing well at minimizing false negatives.
print(
pd.DataFrame(
xgb_tuned_un.feature_importances_, columns=["Imp"], index=X_train_un.columns
).sort_values(by="Imp", ascending=False)
)
Imp V36 0.108 V26 0.053 V18 0.044 V14 0.040 V16 0.036 V39 0.035 V21 0.029 V3 0.028 V10 0.028 V11 0.028 V15 0.027 V37 0.025 V8 0.023 V17 0.023 V32 0.023 V31 0.023 V22 0.022 V30 0.022 V1 0.022 V35 0.021 V12 0.021 V9 0.020 V34 0.019 V6 0.019 V40 0.019 V28 0.018 V13 0.018 V19 0.018 V23 0.018 V24 0.017 V33 0.017 V20 0.017 V27 0.016 V4 0.015 V38 0.015 V25 0.015 V2 0.015 V7 0.014 V29 0.014 V5 0.014
feature_names = X.columns
importances = xgb_tuned_un.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The most important feature identified is V36, followed by V26, V18, V14, V16 and V39
Pipelines to build the final model¶
# Splitting the train data to Target and other variables
X1 = data.drop(columns="Target")
y1 = data["Target"]
# Splitting the test data to Target and other variables
X1_test = test.drop(columns="Target")
y1_test = test["Target"]
print(X1.shape, X1_test.shape)
(20000, 40) (5000, 40)
imputer = SimpleImputer(strategy="median")
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X1_undrsamp, y1_undrsamp = rus.fit_resample(X1, y1)
# Creating new pipeline with best parameters
final_model = Pipeline(
steps=[
("imputer", imputer),
(
"estimator",
XGBClassifier(
n_estimators=200,
scale_pos_weight=10,
learning_rate=0.1,
gamma=5,
subsample=0.8
)
),
],
)
# Fit the model on undersampled training data
final_model.fit(X1_undrsamp, y1_undrsamp)
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('estimator',
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=5, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=0.1,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('estimator',
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=5, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=0.1,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))])SimpleImputer(strategy='median')
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=5, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=200, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)# Let's check the performance on train set
Model_train1 = model_performance_classification_sklearn(
final_model, X1_undrsamp, y1_undrsamp
)
Model_train1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 1.000 | 0.969 | 0.984 |
# Let's check the performance on test set
Model_test1 = model_performance_classification_sklearn(final_model, X1_test, y1_test)
Model_test1
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.842 | 0.894 | 0.249 | 0.389 |
# creating confusion matrix
confusion_matrix_sklearn(final_model, X1_test, y1_test)
Confusion Matrix -
5.04% True Positive (observed=1,predicted=1), model predicted failures and actually its a failure. -- Repair costs
15.24% False Positive (observed=0,predicted=1), model predicted failures but actually its not a failure. -- Inspection Costs
79.12% True Negative (observed=0,predicted=0), model predicted no failures and there is not a failure.
0.60% False Negative (observed=1,predicted=0), model predicted no failures but actually its a failure. -- Replacement costs
Business Insights and Conclusions¶
"It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair."
The goal is to reduce False Negatives since they would result in unidentified failures, which incur a larger cost than repairs or inspection would be.
XGBoost with Undersampled data and tuned with hyperparameters is identified as the best model. The model is able to provide a high recall score, thus minimizing false negatives(which cause very high replacement costs)
On unseen data, this model was able to produce a Recall score of 0.894. The occurance of false negatives in our model is 0.6%.
The most important features identified are V36, followed by V26, V18, V14, V16 and V39. Renewind should put these sensors at the top of the list when monitoring data.
#Convert this notebook to html
!jupyter nbconvert --to html '/content/MT_Project_LearnerNotebook_FullCode_DRuth.ipynb'
[NbConvertApp] Converting notebook /content/MT_Project_LearnerNotebook_FullCode_DRuth.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 135 image(s). [NbConvertApp] Writing 7437159 bytes to /content/MT_Project_LearnerNotebook_FullCode_DRuth.html